-
Notifications
You must be signed in to change notification settings - Fork 35
Gang termination fix #353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Gang termination fix #353
Conversation
shayasoolin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
| allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)")) | ||
| } else if scalingGroupConfig.TerminationDelay.Duration <= 0 { | ||
| allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and another else-if case, to validate that the PCSG-specific termination delay is not greater than the PCS one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://docs.google.com/document/d/11RvFMS1l5RH_FY54G6wd1Q0wSJ2RCobiPRAs2vAhzgM/edit?disco=AAAByqL0jIQ. I discussed with @nvrohanv and @athreesh and while I don't see why someone would do this, we're not going to babysit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the reason why I don't think we should babysit this is we assume if you choose to do gang termination you know what you are doing. In that scenario I can understand someone being more tolerant if out of 3 pcsg one is down they give it a longer amount of time to recover than if the pcs minAvailable is breached then they have less tolerance because the system is not functional. Since theres a valid use case for setting it to be more and we assume this is a power-user feature I think we should provide full flexibility.
|
LGTM!
|
Good question, I've just updated the release notes section in the PR description with some guidance. I believe this will be included in the release notes when we cut a release? CC: @sanjaychatterjee. |
Maybe? I'll take a stab at adding something. I believe @nvrohanv is working on a user guide PR, I might just wait for that to merge then follow the style/approach there. |
|
I will review it this week. |
b04bbb2 to
0c07d49
Compare
| // TerminationDelay overrides the PodCliqueSet-level terminationDelay for this scaling group. | ||
| // Can only be set if PodCliqueSetTemplateSpec.TerminationDelay is set (gang termination is enabled). | ||
| // When set, this value is used instead of the PodCliqueSet-level terminationDelay for gang termination | ||
| // decisions affecting this scaling group's replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need validation that this value should be less than or equal to the PCS one?
| allErrs = append(allErrs, field.Forbidden(fldPath.Index(i).Child("terminationDelay"), "terminationDelay can only be set on PodCliqueScalingGroupConfig when PodCliqueSetTemplateSpec.terminationDelay is set (gang termination is enabled)")) | ||
| } else if scalingGroupConfig.TerminationDelay.Duration <= 0 { | ||
| allErrs = append(allErrs, field.Invalid(fldPath.Index(i).Child("terminationDelay"), scalingGroupConfig.TerminationDelay, "terminationDelay must be greater than 0")) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @shayasoolin. PCSG termination delay should not be more than PCS. We need to provide API where results are deterministic.
| // for each PCSG (PCSG override if set, otherwise PCS default). | ||
| // It returns the names of all such PodCliqueScalingGroups and minimum of all the waitDurations. | ||
| func getMinAvailableBreachedPCSGInfo(pcsgs []grovecorev1alpha1.PodCliqueScalingGroup, terminationDelay time.Duration, since time.Time) ([]string, time.Duration) { | ||
| func getMinAvailableBreachedPCSGInfoWithEffectiveDelay(pcsgs []grovecorev1alpha1.PodCliqueScalingGroup, pcs *grovecorev1alpha1.PodCliqueSet, since time.Time) ([]string, time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test?
|
|
||
| // findMatchingPCSGConfig finds the PodCliqueScalingGroupConfig that matches the given PCSG. | ||
| // It returns nil if no matching config is found. | ||
| func findMatchingPCSGConfig(pcs *grovecorev1alpha1.PodCliqueSet, pcsg *grovecorev1alpha1.PodCliqueScalingGroup) *grovecorev1alpha1.PodCliqueScalingGroupConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test?
| // This field is used for gang termination: a PodClique can only be considered in breach of | ||
| // MinAvailable if it was previously available (WasAvailable=true). | ||
| // +kubebuilder:default=false | ||
| WasAvailable bool `json:"wasAvailable"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this in the API? Wouldn't an update event tell us if the old version of the object was available but the new version is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this in the "GREP":
Motivation for WasAvailable:
Ignore Initial Startup: New PodCliques should not be terminated simply because they haven't yet reached their MinAvailable threshold. The WasAvailable gate ensures gang termination only triggers for workloads that were previously healthy and then became degraded—not for workloads that are still starting up.
Operator Crash Resilience: The WasAvailable flag is persisted in the PodClique's status (not held in memory). This ensures that if the Grove operator crashes and restarts during the transition to availability, the flag is not lost. Because it's a sticky bit that never reverts to
false, the operator can safely restart at any time without missing the transition or incorrectly treating a previously-available PodClique as never-available.
I'll go into a bit more detail in the doc, but we do need to persist this information for a few reasons:
• If the operator crashes and restarts, we can't reliably infer whether a PodClique was ever available just from the conditions. The UpdateInProgress reason doesn't tell us either way, so we'd be guessing.
• There's also a race condition where pods could briefly reach MinAvailable then die before we reconcile. Without the sticky bit, we'd lose that info and wrongly treat it as "never available."
I also considered just adding another condition, but decided against it as it follows the Kubernetes pattern/convention of using status fields for internal bookkeeping (like observedGeneration) rather than conditions. Conditions are more for user-facing state they'd alert on it. So ultimately I went with the separate status boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - After going through the edge case below, I do see that if we respect the previous MinAvailableBreached=True on operator restart (and perhaps all the time) and not require a "FLIP" it would solve this edge case. I'll revive the implementation and see if there are other addition edge cases that pop up. That said, WasAvailable definitely has advantages in terms of simplicity and being explicitly observable though.
For more background, here's the initial approach I tried and abandoned:
The PodClique controller watches "owned" Pods using a predicate. This triggers PodClique controller reconciliation when pods are deleted or have status changes (specifically when the Ready condition transitions). During this reconciliation, the "old" (current but to be reconciled) PodClique object is retrieved from the Kubernetes API. We then "Reconcile" and determine what the new values should be for status fields like PodClique.Status.ReadyReplicas. At this point we essentially have an "old" and "new" PodClique object, allowing us to notice changes in PodClique.Status.ReadyReplicas. Using this approach, we could avoid setting MinAvailableBreached to true unless oldPodClique.Status.ReadyReplicas >= minAvailable in the "old" object. This is insufficient given our requirements however. Consider the following edge case:
Common Starting Point
| Time | Event | MinAvailableBreached Status | Reason | WasAvailable |
|---|---|---|---|---|
| T0 | PCLQ created, 0 pods | False | InsufficientScheduledPods | false |
| T1 | 3 pods scheduled, starting | False | InsufficientScheduledPods | false |
At T1, both scenarios are identical: 3 pods scheduled and starting.
Scenario A: Workload Becomes Available Then Degrades
| Time | Event | MinAvailableBreached Status | Reason | WasAvailable |
|---|---|---|---|---|
| T2 | All 3 pods become ready | False | SufficientReadyPods | true |
| T3 | 1 pod dies (2/3 ready) | True | InsufficientReadyPods | true |
| T4 | Operator crashes | - | - | - |
| T5 | Operator restarts | True | InsufficientReadyPods | ? |
Should gang terminate? YES - workload was healthy, now degraded.
Scenario B: Workload Never Becomes Available
| Time | Event | MinAvailableBreached Status | Reason | WasAvailable |
|---|---|---|---|---|
| T2 | 1 pod crashes during init | True | InsufficientReadyPods | false |
| T3 | Operator crashes | - | - | - |
| T4 | Operator restarts | True | InsufficientReadyPods | ? |
Should gang terminate? NO - workload never achieved availability.
State After Restart Comparison
| Field | Scenario A (T5) | Scenario B (T4) |
|---|---|---|
Status |
True | True |
Reason |
InsufficientReadyPods | InsufficientReadyPods |
ScheduledReplicas |
3 | 3 |
ReadyReplicas |
2 | 2 |
| Should gang terminate? | YES | NO |
The condition and status fields are identical. Without WasAvailable, the operator cannot distinguish between these scenarios.
Thanks for the release notes. Additionally, we need a GREP and usage docs as well. |
Oh I included gang-termination.md as a GREP and used the MNNVL doc as a template. I can take a look at the template PR follow that layout, sure. |
we do need the documentation but i can take the action item of adding that once this is merged in. In general I think the flow should be
|
unmarshall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1/n reviews
| @@ -0,0 +1,375 @@ | |||
| # Gang Termination Design Doc | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
designs directory is going to removed. Please rework on this document and use the GREP template instead. Put the document under proposals. Before writing a GREP please also read docs/proposals/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can just move this document out of this PR. Lets create a single document for gang scheduling and termination. This doc as it stands today needs a bit of work else it will hold your fix for no good reason.
|
|
||
| # Overview | ||
|
|
||
| Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define a component?
Ideally one should introduce what a Gang is, then talk about gang termination.
Gang termination is a mechanism in Grove
Need to change this as well as it is not a new concept introduced in Grove but a commonly known concept in gang scheduled workloads.
|
|
||
| Gang termination is a mechanism in Grove that ensures when a component becomes unhealthy (falls below minimum availability threshold), the entire group (gang) is terminated rather than leaving it in a degraded state. This is particularly important for distributed AI inference workloads where partial availability often means the workload cannot function properly. | ||
|
|
||
| ## Abbreviations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to have this section, then we will have to repeat it for every proposal that is written, which i think is quite unnecessary. Instead create a separate document for commonly used abbreviations and all documents can refer to that.
|
|
||
| | Abbreviation | Full Name | Description | | ||
| |--------------|-----------|-------------| | ||
| | PCS | PodCliqueSet | Grove CRD that manages a set of PodCliques and PodCliqueScalingGroups | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCS does not manage PCLQ and PCSG, it is the controllers within Grove that manage these resources.
| | Abbreviation | Full Name | Description | | ||
| |--------------|-----------|-------------| | ||
| | PCS | PodCliqueSet | Grove CRD that manages a set of PodCliques and PodCliqueScalingGroups | | ||
| | PCLQ | PodClique | Grove CRD representing a group of related pods | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PCLQ is a group of pods that share the same PodSpecTemplate and serve a single purpose.
| - **Configurable Grace Period:** Allow time for Kubernetes scheduler to recover before terminating | ||
| - **Startup Protection:** Avoid terminating workloads that haven't yet reached full availability | ||
| - **Rolling Update Safety:** Prevent false terminations during expected pod churn | ||
| - **Multi-Level Granularity:** Support gang termination at PCLQ, PCSG, and PCS levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is gang termination at PCS level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gang termination at PCS is whole PCS replica getting kicked back to scheduling.
|
|
||
| **Limitations:** | ||
|
|
||
| - **Ready Pods Only (PCLQ level):** Only ready pods count toward availability—starting pods are not considered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this a limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this seems like a feature. Also its only once-ready pods right.
|
|
||
| - **Ready Pods Only (PCLQ level):** Only ready pods count toward availability—starting pods are not considered | ||
| - **Sticky WasAvailable:** Once a PodClique has reached MinAvailable, the WasAvailable flag never reverts, even if the workload is later degraded and recovers | ||
| - **No Partial Recovery:** When gang termination triggers, the affected replica (PCS or PCSG) is deleted in its entirety, healthy PodCliques within that replica are not preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this a limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also agreed, I believe this would only happen if minAvailable is breached at which point we want the whole thing triggered.
| Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`: | ||
|
|
||
| - If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS | ||
| - If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we choose to have this behavior? If we have TerminationDelay at PCS and PCSG level then it should be respected when defined and termination delay is enabled. The issue is that nil value of PCS termination delay has been taken as an indication that this feature has been enabled or disabled. This IMHO is not so nice and also not very intuitive.
From API design perspective when we define a delay at multiple levels, thus allowing overriding, then if PCS does not define TerminationDelay but this is defined at the PCSG level then it should be honored as long as this feature is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm it seems a bit odd to do it that way as well because since minAvailable defaults to 1 (if I remember correctly) then it seems like you should be getting gang semantics throughout. If anything I would lean towards validating that you cant set it on just a pcsg. Something we should discuss more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:
Base PodGang
Comprises of:
minAvailablereplicas of stand-alone PCLQsminAvailablereplicas of PCSG. Within PCSG if constituent PCLQ definesminAvailablethen only those many pods.
This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.
Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.
Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.
Consider the following PCS composition:
- Standalone PCLQ - router
- PCSG - comprises of decode & prefill PCLQs
Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.
What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.
Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.
So as soon as we introduce 2 level termination delay we should consider the behavior carefully.
unmarshall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2/n review comments
| Individual `PodCliqueScalingGroupConfig` entries can override the PCS-level `terminationDelay`: | ||
|
|
||
| - If PCS-level `terminationDelay` is nil, gang termination is disabled for the entire PCS | ||
| - If PCS-level `terminationDelay` is set, each PCSG can optionally override it with its own delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PCLQ if minAvailable is not set then it is defaulted to pclq template replicas. For PCSG it defaults to 1. Gang semantics (scheduling and termination) applies to identified pod-gangs. Currently there are only 2 types of PodGangs:
Base PodGang
Comprises of:
minAvailablereplicas of stand-alone PCLQsminAvailablereplicas of PCSG. Within PCSG if constituent PCLQ definesminAvailablethen only those many pods.
This means that only these many pods are required for a functional application. If the number goes below that and stays like that for terminationDelay seconds then its time to terminate the gang.
Scaled PodGang
Similar behavior applies to scaled PodGang where only minAvailable of constituent PCLQs are considered to determine if minAvailableBreached condition is true.
Now what happens when you do not define TerminationDelay - i believe we then need to have a default (very large) termination delay.
Consider the following PCS composition:
- Standalone PCLQ - router
- PCSG - comprises of decode & prefill PCLQs
Case #1
You define a terminationDelay of 1hr on PCS and 2hr on PCSG.
Now what does it mean for the base PodGang?
If router has minAvailableBreached set to true and it has already exceeded 1hr then it will gang terminate the base PodGang. A higher termination delay on PCSG would not come into play here.
What is the behavior for the scaled PodGang?
This is relatively simple. PCS termination delay will not come into effect. For all Scaled podgangs termination delay defined at the PCSG will only be considered.
Case #2
You define a terminationDelay of 2hr on PCS and 1hr on PCSG.
Now what does it mean for the base PodGang?
For base pod gang only PCS termination delay will apply. So if the PCSG PCLQs have their minAvailableBreached condition set for more than 1hr but less than 2hr, there will not be any gang termination. From the API perspective this behavior is utterly confusing.
So as soon as we introduce 2 level termination delay we should consider the behavior carefully.
Co-authored-by: Madhav Bhargava <madhav.bhargava@sap.com> Signed-off-by: Geoff Flarity <geoff.flarity@gmail.com>
What type of PR is this?
/kind feature
/kind bug
/kind api
What this PR does / why we need it:
This PR fixes and improves gang termination behavior in Grove with several key changes:
Gang termination is now opt-in -
terminationDelayno longer defaults to 4 hours. When nil, gang termination is disabled entirely for the PodCliqueSet.Only ready pods count for breach detection - Previously breach calculation considered scheduled pods; now it only counts ready pods, which more accurately reflects actual workload availability.
PCSG-level terminationDelay override - Individual PodCliqueScalingGroups can now override the PCS-level termination delay, allowing finer-grained control over gang termination timing.
Bug fix: nil pointer when base gang doesn't exist - Fixed a crash when the base gang no longer exists after gang termination.
Which issue(s) this PR fixes:
Fixes #277
Special notes for your reviewer:
docs/designs/gang-termination.md- it provides a high-level overview of the gang termination feature after these changes. Happy to update the code alongside any design doc feedback.terminationDelaycan only be set when PCS-levelterminationDelayis setDoes this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: